home
***
CD-ROM
|
disk
|
FTP
|
other
***
search
/
Collection of Internet
/
Collection of Internet.iso
/
infosrvr
/
dev
/
libhtml_.tar
/
hypertext
/
HTML.txt
< prev
next >
Wrap
Text File
|
1993-01-21
|
26KB
|
514 lines
PUBLIC DRAFT -- HTML
HYPERTEXT MARKUP LANGUAGE
A REPRESENTATION FOR NODES IN THE WORLD WIDE WEB
Daniel W. Connolly, Convex Computer Corp.
January, 1993
Status of this Document
Distribution of this document is unlimited. Please send comments to Dan
Connolly <connolly@convex.com>.
Abstract
The World Wide Web project involves the processing of structured documents
by diverse systems around the globe. Existing document representations
geared towards typesetting, information retrieval, or multimedia are too
tightly coupled to a hardware system, authoring environment, publication
style, or field of study.
HyperText Markup Language was created to fill the need to
Represent existing bodies of information
Connect information entities with hypertext links
Scale to a world-wide scope
Fit into existing and evolving user interface paradigms
Provide an experimental platform for collaborative hypermedia
Contents
Introduction 2
Structured Text 3
Tags 3
Element Types 4
Comments and Other Markup
6
Line Breaks 7
Summary of Markup Signals
7
HTML semantics @@
Rationale @@
References 9
HTML DTD 10
PUBLIC DRAFT -- HTML
INTRODUCTION
The HyperText Markup Language is defined in terms of the ISO Standard
Generalized Markup Language []. SGML is a system for defining structured
document types and markup languages to represent instances of those document
types.
Every SGML document has three parts:
An SGML declaration, which binds SGML processing quantities and syntax
token names to specific values. For example, the SGML declaration in the
HTML DTD specifies that the string that opens a tag is </ and the
maximum length of a name is 40 characters.
A prologue including one or more document type declarations, which
specifiy the element types, element relationships and attributes, and
references that can be represented by markup. The HTML DTD specifies, for
example, that the HEAD element contains at most one TITLE element.
An instance, which contains the data and markup of the document.
We use the term HTML to mean both the document type and the markup language
for representing instances of that document type.
All HTML documents share the same SGML declaration an prologue. Hence
implementations of the WorldWide Web generally only transmit and store the
instance part of an HTML document. To construct an SGML document entity for
processing by an SGML parser, it is necessary to prefix the text from ``HTML
DTD'' on page 10 to the HTML instance.
Conversely, to implement an HTML parser, one need only implement those parts
of an SGML parser that are needed to parse an instance after parsing the
HTML DTD.
PUBLIC DRAFT -- HTML
STRUCTURED TEXT
An HTML instance is like a text file, except that some of the characters are
interpreted as markup. The markup gives structure to the document.
The instance represents a hierarchy of elements. Each element has a name ,
some attributes , and some content. Most elements are represented in the
document as a start tag, which gives the name and attributes, followed by
the content, followed by the end tag. For example:
<HTML> <TITLE> A sample HTML instance </TITLE>
<H1> An Example of Structure </H1> Here's a typical
paragraph. <P> <UL> <LI> Item one has an
<A NAME="anchor"> anchor </A> <LI> Here's
item two. </UL> </HTML> Some elements (e.g. P, LI) are
empty. They have no content. They show up as just a start tag.
For the rest of the elements, the content is a sequence of data characters
and nested elements.
Tags
Every element starts with a tag, and every non-empty element ends with a
tag. Start tags are delimited by < and >, and end tags are delimited
by </ and >.
NAMES
The element name immediately follows the tag open delimiter. Names consist
of a letter followed by up to 33 letters, digits, periods, or hyphens. Names
are not case sensitive.
ATTRIBUTES
In a start tag, whitespace and attributes are allowed between the element
name and the closing delimiter. An attribute consists of a name, an equal
sign, and a value. Whitespace is allowed around the equal sign.
The value is specified in a string surrounded by single quotes or a string
surrounded by double quotes. (See: other tolerated forms @@)
The string is parsed like RCDATA (see below ) to determine the attribute
value. This allows, for example, quote characters in attribute values to be
represented by character references.
The length of an attribute value (after parsing) is limited to 1024
characters.
Element Types
The name of a tag refers to an element type declaration in the HTML DTD. An
element type declaration associates an element name with
A list of attributes and their types and statuses
A content type (one of EMPTY, CDATA, RCDATA, ELEMENT, or MIXED) which
determines the syntax of the element's content
A content model, which specifies the pattern of nested elements and data
EMPTY ELEMENTS
Empty elements have the keyword EMPTY in their declaration. For example:
<!ELEMENT NEXTID - O EMPTY> <!ATTLIST NEXTID N NUMBER
#REQUIRED> This means that the follwing:
<nextid n=''27''> is legal, but these others are not:
<nextid> <nextid n=''abc''>
CHARACTER DATA
The keyword CDATA indicates that the content of an element is character
data. Character data is all the text up to the next end tag open
delimter-in-context. For example:
<!ELEMENT XMP - - CDATA> specifies that the following text is a
legal XMP element:
<xmp>Here's a title. It looks like it has <tags> and <!--comments--> in it,
but it does not. Even this </ is data.</xmp> The string </
is only recognized as the opening delimiter of an end tag when it is ``in
context,'' that is, when it is followed by a letter. However, as soon as the
end tag open delimiter is recognized, it terminates the CDATA content. The
following is an error:
<xmp>There is no way to represent </end> tags in CDATA
</xmp>
REPLACEABLE CHARACTER DATA
Elements with RCDATA content behave much like thos with CDATA, except for
character references and entity references. Elements declared like:
<!ELEMENT TITLE - - RCDATA> can have any sequence of characters in
their content.
Character References
To represent a character that would otherwise be recognized as markup, use a
character referece. The string &# signals a character reference when it
is followed by a letter or a digit. The delimiter is followed by the decimal
character number and a semicolon. For example:
<title>You can even represent </end> tags in RCDATA
</title>
Entity References
The HTML DTD declares entities for the less than, greater than, and
ampersand characters and each of the ISO Latin 1 characters so that you can
reference them by name rather than by number.
The string & signals an entity reference when it is followed by a letter
or a digit. The delimiter is followed by the entity name and a semicolon.
For example:
Kurt Gödel was a famous logician and mathemetician.
Note: To be sure that a string of characters has no markup,
HTML writers should represent all occurences of <,
>, and & by character or entity references.
ELEMENT CONTENT
Some elements have, in stead of a keyword that states the type of content, a
content model, which tells what patterns of data and nested elements are
allowed. If the content model of an element does not include the symbol
#PCDATA , the content is element content.
Whitespace in element content is considered markup and ignored. Any
characters that are not markup, that is, data characters, are illegal.
For example:
<!ELEMENT HEAD - - (TITLE? & ISINDEX? & NEXTID? &
LINK*)> declares an element that may be used as follows:
<head> <isindex> <title>Head
Example</title> </head> But the following are illegal:
<head> no data allowed! </head>
<head><isindex><title>Two isindex
tags</title><isindex></head>
MIXED CONTENT
If the content model includes the symbol #PCDATA, the content of the element
is parsed as mixed content. For example:
<!ELEMENT PRE - - (#PCDATA | A | B | I | U | P)+> <!ATTLIST PRE
WIDTH NUMBER #implied > This says that the PRE element contains
one or more A, B, I, U, or P elements or data characters. Here's an example
of a PRE element:
<pre> <b>NAME</b> cat -- concatenate<a
href=''terms.html#file''>files</a> <b>EXAMPLE</b> cat
<xyz </pre> The content of the above PRE element is:
A B element
The string `` cat -- concatenate''
An A element
The string ``\n''
Another B element
The string ``\n cat <xyz''
Comments and Other Markup
To include comments in an HTML document that will be ignored by the parser,
surround them with <!-- and -->. After the comment delimiter, all
text up to the next occurence of -- is ignored. Hence comments cannot be
nested. Whitespace is allowed between the closing -- and >. (But not
between the opening <! and --.)
For example:
<HEAD> <TITLE>HTML Guide: Recommended Usage</TITLE>
<!-- $Id: recommended.html,v 1.3 93/01/06 18:38:11 connolly Exp $
--> </HEAD> There are a few other SGML markup constructs that
are deprecated or illegal.
Delimiter Signals...
<? Processing instruction. Terminated by >.
<![L Marked section. Marked sections are deprecated. See
the SGML standard for complete information.
<!L Markup declaration. HTML defines no short reference
maps, so these are errors. Terminated by >.
Line Breaks
A line break character is considered markup (and ignored) if it is the first
or last piece of content in an element. This allows you to write either
<PRE>some example text</pre> or
<pre> some example text </pre> and these will be processed
identically.
Also, a line that's not empty but contains no content will be ignored
altogether. For example, the element
<pre> <!-- this line is ignored, including the linebreak
character --> first line third line<!-- the following linebreak is
content: --> fourth line<!-- this one's ignored cuz it's the last piece
of content: --> </pre> contains only the string first line\n\nthird
line\nfourth line.
Summary of Markup Signals
The following delimiters may signal markup, depending on context.
Delimiter Signals
<!-- Comment
&# Character reference
& Entity reference
</ End tag
<! Markup declaration
]]> Marked section close (an error)
< Start tag
PUBLIC DRAFT -- HTML
REFERENCES
ISO 8879:1986, Information ProcessingText and Office
SystemsStandard Generalized Markup Language (SGML)
sgmls an SGML parser by James Clark <jjc@jclark.com>
derived from the ARCSGML parser materials which were
written by Charles F. Goldfarb. The source is
available on the ifi.uio.no FTP server in the
directory /pub/SGML/SGMLS .
WWW
URL
PUBLIC DRAFT -- HTML
<!SGML "ISO 8879:1986" --
HTML DTD
Document Type Definition for the HyperText Markup Language as used
by the World Wide Web application (HTML DTD). NOTE: This is a
definition of HTML with respect to SGML, and assumes an understaning
of SGML terms. For a description of HTML in layman's terms, see
"HTML: A Representation for Nodes in the World Wide Web"
by Dan Connolly. aka
http://info.cern.ch/hypertext/WWW/MarkUp/MarkUp.html by
<connolly@convex.com> -- CHARSET BASESET "ISO
646:1983//CHARSET International Reference Version
(IRV)//ESC 2/5 4/0" DESCSET 0 9 UNUSED 9 2
9 11 2 UNUSED 13 1 13
14 18 UNUSED 32 95 32 127 1
UNUSED BASESET "ISO Registration Number 100//CHARSET
ECMA-94 Right Part of Latin Alphabet Nr. 1//ESC 2/13 4/1" DESCSET 128
32 UNUSED 160 95 32 255 1 UNUSED CAPACITY
SGMLREF TOTALCAP 150000 GRPCAP
150000 SCOPE DOCUMENT SYNTAX SHUNCHAR CONTROLS 0 1 2
3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21
22 23 24 25 26 27 28 29 30 31 127 255 BASESET "ISO
646:1983//CHARSET International Reference Version
(IRV)//ESC 2/5 4/0" DESCSET 0 128 0 FUNCTION RE
13 RS 10 SPACE 32
TAB SEPCHAR 9 NAMING LCNMSTRT ""
UCNMSTRT "" LCNMCHAR ".-" UCNMCHAR ".-"
NAMECASE GENERAL YES ENTITY NO
DELIM GENERAL SGMLREF SHORTREF SGMLREF
NAMES SGMLREF QUANTITY SGMLREF NAMELEN 34
TAGLVL 100 LITLEN 1024
GRPGTCNT 150 GRPCNT 64 FEATURES
MINIMIZE DATATAG NO OMITTAG NO RANK NO SHORTTAG NO
LINK SIMPLE NO IMPLICIT NO EXPLICIT NO OTHER CONCUR NO
SUBDOC NO FORMAL YES APPINFO NONE > <!DOCTYPE HTML
[ <!-- $Id: html.dtd,v 1.4 93/01/20 20:56:08 connolly Exp $ -->
<!-- Regarding clause 6.1, SGML Document: [1] SGML document =
SGML document entity, (SGML subdocument entity |
SGML text entity | non-SGML data entity)* The role of SGML document
entity is filled by this DTD, followed by the conventional HTML data
stream. --> <!-- DTD definitions --> <!ENTITY % heading
"H1|H2|H3|H4|H5|H6" > <!ENTITY % list "UL|OL|DIR|MENU">
<!ENTITY % literal "XMP|LISTING"> <!ENTITY % headelement
"TITLE | NEXTID | ISINDEX" > <!ENTITY % bodyelement "P |
%heading | %list | DL | HEADERS | ADDRESS | PRE | BLOCKQUOTE
| %literal"> <!ENTITY % oldstyle "%headelement | %bodyelement |
#PCDATA"> <!ENTITY % URL "CDATA" -- The term URL means a
CDATA attribute whose value is a Universal Resource Locator,
as defined in ftp://info.cern.ch/pub/www/doc/url3.txt -->
<!ENTITY % linkattributes "NAME NMTOKEN #IMPLIED HREF
%URL; #IMPLIED TYPE NAME #IMPLIED -- type of relashionship to
referent data: PARENT CHILD, SIBLING, NEXT,
TOP, DEFINITION, UPDATE, ORIGINAL etc. --
URN CDATA #IMPLIED -- universal resource number. unique doc id --
TITLE CDATA #IMPLIED -- advisory only -- METHODS NAMES #IMPLIED
-- supported methods of the object:
TEXTSEARCH, GET, HEAD, ... -- "> <!-- Document Element
--> <!ELEMENT HTML O O ((HEAD | BODY | %oldstyle)*,
PLAINTEXT?)> <!ELEMENT HEAD - - (TITLE? & ISINDEX? &
NEXTID? & LINK*)> <!ELEMENT TITLE - - RCDATA -- The
TITLE element is not considered part of the flow of text. It
should be displayed, for example as the page header or window
title. --> <!ELEMENT ISINDEX - O EMPTY -- WWW
clients should offer the option to perform a search on
documents containing ISINDEX. --> <!ELEMENT NEXTID - O
EMPTY> <!ATTLIST NEXTID N NUMBER #REQUIRED -- The number
should be the highest number that appears in any NAME attribute
in the document. --> <!ELEMENT LINK - O EMPTY>
<!ATTLIST LINK %linkattributes> <!ENTITY %
inline "EM | TT | STRONG | B | I | U | CODE | SAMP |
KBD | KEY | VAR | DFN | CITE " > <!ELEMENT (%inline;) - -
(#PCDATA)> <!ENTITY % hypertext "#PCDATA | %inline; | A">
<!ELEMENT BODY - - (%bodyelement|%hypertext;)*> <!ELEMENT A
- - (#PCDATA)> <!ATTLIST A %linkattributes; >
<!ELEMENT P - O EMPTY -- separates paragraphs --> <!ELEMENT
(%heading) - - (%hypertext;)+> <!ELEMENT DL - - (DT | DD | P
| %hypertext;)*> <!-- Content should match
((DT,(%hypertext;)+)+,(DD,(%hypertext;)+)) But mixed content is
messy. --> <!ATTLIST DL STYLE NAME #IMPLIED -- COMPACT,
etc.-- > <!ELEMENT DT - O EMPTY> <!ELEMENT DD
- O EMPTY> <!ELEMENT (UL|OL) - - (%hypertext;|LI|P)+>
<!ELEMENT (DIR|MENU) - - (%hypertext;|LI)+> <!-- Content
should match ((LI,(%hypertext;)+)+) But mixed content is messy.
--> <!ELEMENT LI - O EMPTY> <!ELEMENT BLOCKQUOTE - -
(%hypertext;|P)+ -- for quoting some other source -->
<!ATTLIST BLOCKQUOTE SOURCE CDATA #IMPLIED -- URL of source --
> <!ELEMENT ADDRESS - - (%hypertext;|P)+> <!ELEMENT
PRE - - (#PCDATA | A | B | I | U | P)+> <!ATTLIST PRE WIDTH
NUMBER #implied > <!-- Mnemonic character entities. -->
<!ENTITY AElig "Æ" -- capital AE diphthong (ligature) -->
<!ENTITY Aacute "Á" -- capital A, acute accent -->
<!ENTITY Acirc "Â" -- capital A, circumflex accent -->
<!ENTITY Agrave "À" -- capital A, grave accent -->
<!ENTITY Aring "Å" -- capital A, ring --> <!ENTITY
Atilde "Ã" -- capital A, tilde --> <!ENTITY Auml
"Ä" -- capital A, dieresis or umlaut mark --> <!ENTITY
Ccedil "Ç" -- capital C, cedilla --> <!ENTITY ETH
"Ð" -- capital Eth, Icelandic --> <!ENTITY Eacute
"É" -- capital E, acute accent --> <!ENTITY Ecirc
"Ê" -- capital E, circumflex accent --> <!ENTITY Egrave
"È" -- capital E, grave accent --> <!ENTITY Euml
"Ë" -- capital E, dieresis or umlaut mark --> <!ENTITY
Iacute "Í" -- capital I, acute accent --> <!ENTITY Icirc
"Î" -- capital I, circumflex accent --> <!ENTITY Igrave
"Ì" -- capital I, grave accent --> <!ENTITY Iuml
"Ï" -- capital I, dieresis or umlaut mark --> <!ENTITY
Ntilde "Ñ" -- capital N, tilde --> <!ENTITY Oacute
"Ó" -- capital O, acute accent --> <!ENTITY Ocirc
"Ô" -- capital O, circumflex accent --> <!ENTITY Ograve
"Ò" -- capital O, grave accent --> <!ENTITY Oslash
"Ø" -- capital O, slash --> <!ENTITY Otilde "Õ" --
capital O, tilde --> <!ENTITY Ouml "Ö" -- capital O,
dieresis or umlaut mark --> <!ENTITY THORN "Þ" -- capital
THORN, Icelandic --> <!ENTITY Uacute "Ú" -- capital U,
acute accent --> <!ENTITY Ucirc "Û" -- capital U,
circumflex accent --> <!ENTITY Ugrave "Ù" -- capital U,
grave accent --> <!ENTITY Uuml "Ü" -- capital U, dieresis
or umlaut mark --> <!ENTITY Yacute "Ý" -- capital Y, acute
accent --> <!ENTITY aacute "á" -- small a, acute accent
--> <!ENTITY acirc "â" -- small a, circumflex accent
--> <!ENTITY aelig "æ" -- small ae diphthong (ligature)
--> <!ENTITY agrave "à" -- small a, grave accent -->
<!ENTITY amp "&" -- ampersand --> <!ENTITY aring
"å" -- small a, ring --> <!ENTITY atilde "ã" --
small a, tilde --> <!ENTITY auml "ä" -- small a, dieresis
or umlaut mark --> <!ENTITY ccedil "ç" -- small c, cedilla
--> <!ENTITY eacute "é" -- small e, acute accent -->
<!ENTITY ecirc "ê" -- small e, circumflex accent -->
<!ENTITY egrave "è" -- small e, grave accent -->
<!ENTITY eth "ð" -- small eth, Icelandic --> <!ENTITY
euml "ë" -- small e, dieresis or umlaut mark --> <!ENTITY
gt ">" -- greater than --> <!ENTITY iacute "í" --
small i, acute accent --> <!ENTITY icirc "î" -- small i,
circumflex accent --> <!ENTITY igrave "ì" -- small i, grave
accent --> <!ENTITY iuml "ï" -- small i, dieresis or umlaut
mark --> <!ENTITY lt "<" -- less than --> <!ENTITY
ntilde "ñ" -- small n, tilde --> <!ENTITY oacute
"ó" -- small o, acute accent --> <!ENTITY ocirc
"ô" -- small o, circumflex accent --> <!ENTITY ograve
"ò" -- small o, grave accent --> <!ENTITY oslash
"ø" -- small o, slash --> <!ENTITY otilde "õ" --
small o, tilde --> <!ENTITY ouml "ö" -- small o, dieresis
or umlaut mark --> <!ENTITY szlig "ß" -- small sharp s,
German (sz ligature) --> <!ENTITY thorn "þ" -- small thorn,
Icelandic --> <!ENTITY uacute "ú" -- small u, acute accent
--> <!ENTITY ucirc "û" -- small u, circumflex accent
--> <!ENTITY ugrave "ù" -- small u, grave accent -->
<!ENTITY uuml "ü" -- small u, dieresis or umlaut mark -->
<!ENTITY yacute "ý" -- small y, acute accent -->
<!ENTITY yuml "ÿ" -- small y, dieresis or umlaut mark -->
<!-- deprecated elements --> <!ELEMENT (%literal) - -
CDATA> <!ELEMENT PLAINTEXT - O EMPTY> <!-- Local Variables:
--> <!-- mode: sgml --> <!-- compile-command: "sgmls -s -p "
--> <!-- end: --> ]>